logo of company

Bioinformatics pipeline summary


Where we see the pipeline processes

Author: Adrien Taudière

Date: December 3, 2024

Summary of the bioinformatic pipeline

Code
library(knitr)
library(targets)
library(MiscMetabar)
here::i_am("analysis/01_bioinformatics.qmd")
source(here::here("R/styles.R"))
source(here::here("R/functions.R"))

Timeline and cpu usage

Code
log_data <- autometric::log_read(
  here::here("data/data_final/autometric_log.txt"),
  units_time = "hours",
  units_memory = "gigabytes"
)

log_data |>
  filter(!grepl("conclude:", phase)) |>
  filter(!grepl("prepare:", phase)) |>
  filter(!grepl("__DEFAULT__", phase)) |>
  ggplot(aes(x = time, y = reorder(phase, desc(time)), color = resident)) +
  geom_line(aes(linewidth = cpu)) +
  scale_color_viridis_b("Memory (Gb)", end = 0.9, direction = -1) +
  theme_idest() +
  xlab("Time (in hours)") +
  ylab("")

Main phyloseq object

You may want to select another targets such as d_asv or d_vs_mumu for example.

Code
d_pq <- clean_pq(tar_read("d_vs", store=here::here("_targets/")))
Cleaning suppress 0 taxa and 26 samples.
Code
summary_plot_pq(d_pq)
Cleaning suppress 0 taxa and 0 samples.

Code
tar_glimpse(script=here::here("_targets.R"), targets_only = TRUE, callr_arguments = list(show = FALSE))
Code
tar_meta(store=here::here("_targets/"), targets_only = TRUE) |> 
  dplyr::mutate(time = paste0(seconds %/% 3600,":",seconds %/% 60,":",floor(seconds %% 60)))|>
  dplyr::select(name, seconds, bytes, format, time) |>
  dplyr::mutate(Gb=round(bytes/10^9,2)) |>
  dplyr::arrange(desc(seconds), desc(bytes))  |> 
  kable()
name seconds bytes format time Gb
d_asv 15175.810 1121258 rds 4:252:55 0.00
tax_tab 5874.178 810117 rds 1:97:54 0.00
ddF 1706.088 134739992 qs 0:28:26 0.13
ddR 1416.051 146350364 qs 0:23:36 0.15
filtered 642.999 166 rds 0:10:42 0.00
quality_raw_seq 577.204 19103 rds 0:9:37 0.00
quality_seq_wo_primers 502.534 18972 rds 0:8:22 0.00
cutadapt 325.437 69632 file 0:5:25 0.00
quality_seq_filtered_trimmed_FW 255.565 11354 rds 0:4:15 0.00
quality_seq_filtered_trimmed_REV 255.029 11354 rds 0:4:15 0.00
err_fs 241.244 21768 qs 0:4:1 0.00
err_rs 231.799 24521 qs 0:3:51 0.00
derep_rs 211.368 2780860228 qs 0:3:31 2.78
derep_fs 183.569 1764068735 qs 0:3:3 1.76
track_sequences_samples_clusters 112.782 422 rds 0:1:52 0.00
merged_seq 101.611 1440062 qs 0:1:41 0.00
track_by_samples 94.841 9016 rds 0:1:34 0.00
bioinfo_report 47.120 44 rds 0:0:47 0.00
seqtab_wo_chimera 15.211 818071 rds 0:0:15 0.00
d_vs 5.692 918040 rds 0:0:5 0.00
d_vs_mumu 4.358 902295 rds 0:0:4 0.00
d_vs_mumu_rarefy 0.349 868910 rds 0:0:0 0.00
seq_tab_Pairs 0.259 960195 rds 0:0:0 0.00
data_phyloseq 0.091 1039190 rds 0:0:0 0.00
s_d 0.045 10846 rds 0:0:0 0.00
seqtab 0.011 811557 rds 0:0:0 0.00
asv_tab 0.011 811434 rds 0:0:0 0.00
data_raw 0.010 5380 rds 0:0:0 0.00
sam_tab 0.004 3577 rds 0:0:0 0.00
fastq_files_folder 0.002 45056 file 0:0:0 0.00
file_sam_data_csv 0.001 3858 file 0:0:0 0.00
data_fnfs 0.001 2763 rds 0:0:0 0.00
samp_n_otu_table 0.001 1996 rds 0:0:0 0.00
file_refseq_taxo 0.000 114270891 file 0:0:0 0.11
data_fnrs 0.000 2763 rds 0:0:0 0.00

Load phyloseq object from targets store

Code
d_pq <- tar_read("d_vs", store=here::here("_targets/"))

The {targets} package is at the core of this project. Please read the intro of the user manual if you don’t know {targets}.

The {targets} package store … targets in a folder and can load (tar_load()) and read (tar_read) object from this folder.

Sample data

Code
DT::datatable(d_pq@sam_data)

Sequences, samples and clusters across the pipeline

Make krona html files

Code
krona(clean_pq(d_pq, simplify_taxo =TRUE), file = here::here("data/data_final/krona_nb_seq.html"))
Cleaning suppress 0 taxa and 26 samples.
Code
krona(clean_pq(d_pq, simplify_taxo =TRUE), nb_seq = FALSE, file =  here::here("data/data_final/krona_nb_taxa.html"))
Cleaning suppress 0 taxa and 26 samples.

Session Information

Session information are detailed below. More information about the machine, the system, as well as python and R packages, are available in the file data_final/information_run.txt .

Code
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Debian GNU/Linux 12 (bookworm)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] MiscMetabar_0.10.1 purrr_1.0.2        dplyr_1.1.4        dada2_1.34.0      
[5] Rcpp_1.0.13-1      ggplot2_3.5.1      phyloseq_1.50.0    targets_1.9.0     
[9] knitr_1.49        

loaded via a namespace (and not attached):
  [1] bitops_1.0-9                deldir_2.0-4               
  [3] permute_0.9-7               rlang_1.1.4                
  [5] magrittr_2.0.3              ade4_1.7-22                
  [7] matrixStats_1.4.1           compiler_4.4.2             
  [9] mgcv_1.9-1                  png_0.1-8                  
 [11] callr_3.7.6                 vctrs_0.6.5                
 [13] reshape2_1.4.4              autometric_0.1.2           
 [15] stringr_1.5.1               pwalign_1.2.0              
 [17] pkgconfig_2.0.3             crayon_1.5.3               
 [19] fastmap_1.2.0               backports_1.5.0            
 [21] XVector_0.46.0              labeling_0.4.3             
 [23] utf8_1.2.4                  Rsamtools_2.22.0           
 [25] rmarkdown_2.29              UCSC.utils_1.2.0           
 [27] ps_1.8.1                    xfun_0.49                  
 [29] cachem_1.1.0                zlibbioc_1.52.0            
 [31] GenomeInfoDb_1.42.1         jsonlite_1.8.9             
 [33] biomformat_1.34.0           rhdf5filters_1.18.0        
 [35] DelayedArray_0.32.0         Rhdf5lib_1.28.0            
 [37] BiocParallel_1.40.0         jpeg_0.1-10                
 [39] parallel_4.4.2              cluster_2.1.6              
 [41] R6_2.5.1                    bslib_0.8.0                
 [43] RColorBrewer_1.1-3          stringi_1.8.4              
 [45] jquerylib_0.1.4             GenomicRanges_1.58.0       
 [47] SummarizedExperiment_1.36.0 iterators_1.0.14           
 [49] IRanges_2.40.0              Matrix_1.7-1               
 [51] splines_4.4.2               igraph_2.1.1               
 [53] tidyselect_1.2.1            rstudioapi_0.17.1          
 [55] abind_1.4-8                 yaml_2.3.10                
 [57] vegan_2.6-8                 codetools_0.2-20           
 [59] hwriter_1.3.2.1             processx_3.8.4             
 [61] lattice_0.22-6              tibble_3.2.1               
 [63] plyr_1.8.9                  Biobase_2.66.0             
 [65] withr_3.0.2                 ShortRead_1.64.0           
 [67] evaluate_1.0.1              survival_3.7-0             
 [69] RcppParallel_5.1.9          Biostrings_2.74.0          
 [71] pillar_1.9.0                BiocManager_1.30.25        
 [73] MatrixGenerics_1.18.0       DT_0.33                    
 [75] renv_1.0.11                 foreach_1.5.2              
 [77] stats4_4.4.2                generics_0.1.3             
 [79] rprojroot_2.0.4             S4Vectors_0.44.0           
 [81] munsell_0.5.1               scales_1.3.0               
 [83] base64url_1.4               glue_1.8.0                 
 [85] tools_4.4.2                 interp_1.1-6               
 [87] data.table_1.16.2           GenomicAlignments_1.42.0   
 [89] visNetwork_2.1.2            rhdf5_2.50.0               
 [91] grid_4.4.2                  ape_5.8                    
 [93] crosstalk_1.2.1             latticeExtra_0.6-30        
 [95] colorspace_2.1-1            nlme_3.1-166               
 [97] GenomeInfoDbData_1.2.13     cli_3.6.3                  
 [99] fansi_1.0.6                 viridisLite_0.4.2          
[101] S4Arrays_1.6.0              gtable_0.3.6               
[103] sass_0.4.9                  digest_0.6.37              
[105] BiocGenerics_0.52.0         SparseArray_1.6.0          
[107] farver_2.1.2                htmlwidgets_1.6.4          
[109] htmltools_0.5.8.1           multtest_2.62.0            
[111] lifecycle_1.0.4             here_1.0.1                 
[113] httr_1.4.7                  secretbase_1.0.3           
[115] MASS_7.3-61                

Citation

BibTeX citation:
@online{taudière2024,
  author = {Taudière, Adrien},
  title = {Bioinformatics Pipeline Summary},
  date = {2024-12-03},
  langid = {en}
}
For attribution, please cite this work as:
Taudière, Adrien. 2024. “Bioinformatics Pipeline Summary.” December 3, 2024.